DOCUMENT RESUME 



ED 036 870 

AUTHOR 

TITLE 



INSTITUTION 

SPONS AGENCY 

BUREAU NO 
PUB DATE 
CONTRACT 
NOTE 



EDRS PRICE 
DESCRIPTORS 







24 






CG 


00 


5 1 


Bak 


er. 


Eva L, 












The 


Effects of Mani 


pulated Ite 


m Writing 


Const 


rai 


nts 


on 


the 


Homogeneity 


of Test Ite 


ms. Center 


for 


the 




Stu 


dy 


of Evaluation 


Reprint Se 


ries No- 1 


1 , 






Cal 


ifornia Univ„, L 


os Angeles. 


Center to 


r the 


St 


udy 



of Evaluation- 

Office of Education (DHEW) , Washington, D«C« Bureau 
of Research* 

BR-6-1646 
Mar 70 

OEC-4-6-0 1646-1909 

12p- ; Paper presented at National Council for 
Measurement in Education Conference, Minneapolis, 
Minnesota, March, 1970 

EDRS Price MF-$0.25 HC-$Q„70 

^Behavioral Objectives, Educational Objectives, 
*Item Analysis, Objectives, ^Research Methodology, 
*Test Construction, Testing, *Tests, Test Selection 



ABSTRACT 

To help teachers who must produce test items to 
measure instructional objectives, 54 teacher education candidates 
participated in an experiment where easily understood constraints on 
item production were manipulated- Four forms of a test item writing 
exercise sheet were randomly distributed, each asking for the 
production of eight sample test items, two for each specified topic. 
The subjects produced 16 items, to be used for seventh grade 
students.. Two 16 item tests were constituted, one on subtraction and 
one on current events. The tests were administered to 51 junior high 
school students- Means and standard deviations of the items were 
computed, and analysis of variance for the subtest means was 
conducted for each replication. Significant differences (F : =8o3, df=3, 
12) were observed for subtraction. For the current events data 
differences were not significant. Findings are limited by the number 
of items on each subtest, further staff studied are investigating how 
tc produce items truly congruent with objectives and how best to 
translate these findings into practical procedures for teacher. 
(Author/CJ) 



o 

ERIC 




Marvin C. Alkin 
Director 



UCLA Graduate School of Education 

The CENTER FOR THE STUDY OF EVALUATION (CSE) is one of 
nine centers for educational research and development, sponsored 
by the United States Department of Health, Education, and Welfare, 
Office of Education. Established at UCLA in June, 1966, CSE is de- 
voted exclusively to finding new theories and methods of analyzing 
educational systems and programs and gauging their effects. 

The Center serves its unique function with an interdisciplinary 
staff whose specialties combine for a broad, versatile approach to the 
complex problems of evaluation. Study projects are conducted in three 
major program areas: Evaluation of Instructional Programs, Evaluation 
of Educational Systems, and Evaluation Methodology and Services. 
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i k.’. g. ea test curse, wise men have said, may be to have your wishes com; 



A case in oofrt is the advocacy of ob ;ccti ves-hased instruction ard 
ion, where teachers test, teach, and retest children until desired 
levels of mastery are reached. The tests used in this type of instruction 
differ r rom^ commercially produced achievement tests because they are directed 
tova^ specific program goals, usually stated in operational lancjcre. P^gram 
o 1 a nn i O'* and budgeting systems are expanding the appeal of such approaches 
and the sail for objectives rnd items has ; net' eased. While a fledgling 
Institution has emerged to bear part of the burden for generating some of the 
objectives and items needed for la'-ge scale implementation of such ~r approach, 
ft has become clear that more items will be demanded than can currently be 
prepared. Obviously, if a teacher needs a great number of items for iterative 
testing, he will either produce them himself, or go without ard revert to t. 
more usual instructional pattern. 

What kind of help can be provided for the teacher who must produce te; t 
items to measure his instructional objectives? Do simple procedures exist 
which allow the teacher to p-oducc homogeneous test items? Some clear 
alternatives to control 'torn production involve the use of behaviorally 
stated objectives, sample test items and simplified item forms. 

Improved production of test items has historically been one of the 
benefits emphasized by curriculum specialists advocating behavioral 'b^ec tives. 
Broadly stated objectives make the estimate o f congruence between objective 
ard ■» iem difficult to determine. For example, if or.e were asked to produce 
items to measure an objective such as '’understanding of statistical concepts 1 „ 
a great number of items would be considered suitable, and depending uoon w-irh 
set h tn'ifPf.d to be used by the Instructor, vastly different notions about 
s tuber t achieve nent would be inferred. However, if ibv objective war, modified 
to * ' t h s to cert would have to select and justify a statistical analysis for 
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those research designs described by Campbell and Stanley," performance on 
an appropriate set of items should give a fairly good idea of the attainment 
of the objective. A further way to reduce the heterogeneity of responses to 
the items might be to employ a standard format for each item. Additionally, 
if the content to be sampled was made more precise* then one would assume 
that increased homogeneity would be demonstrated by sets of items measuring 
the same objective. 

The item form, under development at Minnesota, describes both the ^ormat 
which the items in a set should take and the content limits which should be 
observed. Attention has been directed to variants of this idea both at UCLA ^ 
end the Scjthwcst Regional Laboratory for Educational Research and Development. 
The Project for Research on Object! ves-Based Evaluation (PROBE), a program of 
the UCLA .'enter for the Study of Evaluation, used generation rules for producing 
sets of '-terns to accompany objectives for the Instructional Objectives Exchange. 
These ru es limited the format of the item and defined the content area to be 
assesseo. However, when teachers were asked to use these generation rules to 
produce additional items, to measure the objectives, they were appalled by the 
difficulty they experienced in deciphering the technical language of the rules. 



Method 

To gain a modest amount of additional information, an experiment; was 
conducted where various easy-to~under$tand constraints on item production wore 
manipulated for a pooulation of teacher educational candidates. Effects on 
item homogeneity were to be observed. 

$' bjects . Fifty teacher education candidates enrolled in a curriculum 
court? we‘'e the subjects who generated the test items. These students were 
senior;; and graduates enrolled in summer session. They were given an 
ostensible test writing exercise as one assignment in their course. 

treatments. Four forms of a test item writing exercise sheet were 
randon Ty~dTsTrTbu ted to the subjects. Each form asked for the production 
of eicit sample test items, two for each of the fotlovring topics* Current 
Events, Subtraction, Graphs, and Punctuation Errors. Form one of the exercise 
provUsd an objective stated very generally. For the first topic, the statement 
was a; follows; "Awareness of the relationship of personalities to current 
event.," Form two provided a behavioral objective to guide the item writing. 

The ojective for the Current Events tooic was: "To be able to identify people 

associated with important current events," Even though considerable clarity 
ft reflected in this objective, a number of interprets t Ions of it were 
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obviously possible. For the sene topic., Fo*m three again listed the objec- 
tive, but, if> addition, supplied a snmilc mult ip’ c choice item i n which the 
current event was stated in the stem and alternatives wore the na:.r>r, i»f 
persona 1 ir ies . fhi s condition is id".' tical ( o the way in which oh jec tiv%:$ 
from the Instruct inn, j7 Ob ■ e<: t i vos £xcha»<o ere di ssemi nctf.ri, s i r ce c,.ch oh>cc- 
ti ve Is acre >ani e J by a sample i tom. Form four also included the Sr.mo 
behavioral objective for this lone. In addition, five statements designed 
to constrain the type o r item r> reduced were provided as follows; 



a. The format should be null hole choice. 

b* There should be only fou" alternatives provided, 

c. j*he current event description should aopear in the stem of the iten; 
people’s names should form the a 1 to mat i ves. 

d„ Only one ansv/er should be right for each question; "none of the above" 
or "all of the above" should not be alternatives, 

e. Current events should he limited to occurrences within the last two 
years which probably received front 1 a ge space in the newspaper. An 
example might be space exploration. 

The first four statements related only to the format of the item while 
the last statement attempted tc restrict the content domain from which the 
Item writer could draw. The sample multiple choice item provided in Form 
three was an instance of an item which would fit the description given in 
Form four. 



P r oecdur e. The subjects wore allowed approximately 9D minutes to produce 
the l^Ttcms. Directions v/erp given in each exercise form that the items would 
be used for seventh grade students. Subjects were as^ed to avoid inflated 
language, provide necessary test directions, and to supply either the right 
answer or criteria for judging each answer, so i terns could be scored. 



Comoo si t i on of the fes t « Items produced were -segregated by treatment 
and by topic‘s Two iSTTtem tests were constituted, each composed of four items 
randomly selected from those p'oduced by item writers in each treatment, one 
for the topic of subtraction and ore for current events. Within each topic 
the items were randomly ordered except all constructed responses were grouped 
together to minimize the distraction of changing response sets. The tonic of 
subtraction was selected because performance in that area might simulate that 
of an "instructed" group, *ince subtraction practice has generally been 
encountered by most seventh grade students. Current events, however, might 
represent an area given less systematic instructional attention. Perhaps 
differing levels of competence for the topics might be reflected ir the data. 

Field Trial » Fifty-one seventh grade students in a Los Angelos junior 
high scKool were administered the ?2 items. Children were tolr 1 that they were 
being compared with other seventh grade students in their subtraction ar d 
current events skills and were given one hour to complete all 32 items, gich? 



answers were r ead 
comole ted. 



-i.- 



to them by their teacher after the entire test had beer 



Datja Ana1 y sf s and Re su 1 ts 

Means and standard deviations of the items were corpputed and are 
reported in Table 1* Analysis of variance fo~ the subtest means was conducted 
for each replication. Significant differences (F»J. 3, df=3, 12) were 
observed for the subtraction topic. Items produced under the most constrained 
conditions, that is, wi th a sample test item as a model or the modified item 
form, oroduced items with higher means# The same order effect was observed 
in the current events data but the differences were not found to be significant. 

On a common sense basis, one would generally assume that items generated 
under a given treatment condition would correlate better among themselves 
than with subtests produced under different treatment conditions. However, 
an exception might be found for those items produced under the nonbehavioral 
objective condition. Such items might oe expected to differ considerably 

from one another and might fail to correlate highly with each other or with 
any of the other subtests. 

Point bfserial correlations were computed for each subtest generated by 
the four treatments for both replications (See Tables 2 and 3). The averaoe 
correlation of items with their own subtest was compared with the average 
correlation of items with each of the other three subtusts# Four separate 
analyses of variance wore conducted for the two topics. For the current 
events tonic, significant differences found for each of the "constrained 1 * 
treatments, that is, items produced with either an objective, test item, or 
modified item form as a gufd$ tended to correlate better among themselves 
then with items oroduced by t.,e other treatments. The exceotion, in current 
events, was the analysis conducted on the nonbehavioral subtest. Ho signff.- 
cant differences v/ere obtained, and in fact, none of the mean correlations 
was above ,35, in the subtraction replication, significant differences were 
found on each of the analyses of variance conducted, Perheos because the tooic 
of subtraction in itself, provided sufficient structure, the correlations 
observed were considerably higher. 



Imol ications 



Modest evidence was found tha 
were more homogeneous than i terns p 
current events topic. The di f fere 
effects in both the repl ications. 
differences found in the subtract! 
treatment, might be a function of 
When one inspects the mean Correia 
particular advantage was found for 
test item, or modified item form, 
corre 1 a t ions produced by ;h esc tre 



t items produced with some constraints 
roduccd under general conditions for the 
nt treatments did seerr, to have orec.'f cted 
The di sconf i rmi ng evidence, the significant 
on replication for rhe nonbeha viora 1 
the precision of the subject matter >tseJf. 
tions of the "constrained 11 treatments, no 
either the behavioral objective, .ample 
In the current events reel ications, the 
atments are within one noint of each other. 
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x ( fti subtraction rep) lea lion, they are w; four points of each c- 

One factor which obviously limits the findings of this study was the 
number of items on each subtext. The selection of four items for e<.ch 
subtest 'as not divinely inspired. Rather , the number of items selected 
was in pirt determined by th <2 original effects of the treatments. Subjects in 
the treatment one, writing test items under the “nonbehavioral” condition 
tended o generalize the lock of structure to the extent that only four items 
of the 26 produced for the topic of current events were xcorable, that is. 
Included either right answers or means for determining the right answer. One 
of these items was in multiple choice format while the other three were 
completion items. So the usable items generated by the treatment contained 
much nore structure than most of the items produced by subjects in that 
trea ment group. One could expect even mo e variability than was observed 
to hi associated with the disparate items which were generated but not 
usaMe, e.g.* “Write and essay describing the contribution of a famous 20th 
cen:ury man„“ Even fewer usable Items were produced on the tonics of punctuation 
errors and greohs. 

Clear 1 y t the study did not produce evidence compelling enough to change 
th current method of providing teachers with a sample test item accomoanying 
ee h objective. Further studies are underway by the PROBE staff co tinuing 
to investigate how to produce items truly congruent with objectives and how 
or*, can best translate these findings into practical procedures for teachers* 



Table !, Neans and Standard Deviations of Items by Treatment Group 
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